Enter from
step

ABSTRACT

AN IMPROVES PROGRAM METHOD IS SHOWN FOR DETERMINING THE SIMILARITY OF TWO CHARACTER STRINGS WHERE THEY ARE NOT NECESSARILY EQUAL CHARACTER BY CHARACTER. EACH CHARACTER OF ONE CHARACTER STRING IS COMPARED WITH EACH CHARACTER OF THE OTHER CHARACTER STRING, AND THE RESULTS ARE STORED IN A MATRIX. EACH ROW OF THE MATRIX CORRESPONDS TO A RESPECTIVE ONE OF THE CHARACTERS OF THE FIRST STRING, AND EACH COLUMN OF THE MATRIX CORRESPONDS TO A RESPECTIVE CHARACTER OF THE OTHER STRING. A LOGICAL ONE VALUE IS STORED IN EACH MATRIX POSITION WHERE CORRESPONDING CHARACTERS OF THE FIRST AND SECOND STRING ARE EQUAL, AND A LOGICAL ZERO IS STORED IN EACH POSITION WHERE THEY ARE NOT EQUAL. IN EVENT THAT THERE IS MORE THAN ONE LOGICAL ONE IN EACH ROW (OR IN EACH COLUMN OR BOTH), ALL OF THE LOGICAL ONES IN THAT ROW (OR COLUMN, OR BOTH) EXCEPT THAT WHICH IS CLOSEST TO THE MAJOR DIAGONAL OF THE MATRIX ARE   CHANGED TO LOGICAL ZERO. IN THE EVENT THAT TWO LOGICAL ONES IN A ROW (OR COLUMN, OR BOTH) ARE EQUIDISTANT FROM THE MAJOR DIAGONAL, BOTH MAY BE RETAINED. THE REMAINING LOGICAL ONES IN THE MATRIX ARE THEN CONSIDERED AS POINTS ON AN X, Y COORDINATE SYSTEM, AND THE STANDARD CORRELATION COEFFICIENT (WHICH MEASURES LINEAR DEPENDENCE) OF THE POINTS IS CALCULATED TO DETERMINE THE DEGREE OF SIMILARITY BETWEEN THE TWO CHARACTER STRINGS. A DECISION THAT SUFFICIENT SIMILARITY EXISTS TO MAKE A DETERMINATION THAT A MATCH HAS OCCURRED MAY BE BASED UPON VARIOUS KNOWN TESTS, FOR EXAMPLE, EXCEEDING PREDETERMINED THRESHOLDS, OR, IN THE EVENT THAT A MATCH IS MADE AGAINST SEVERAL CHARACTER STRINGS, THE HIGHEST COEFFICIENT CALCULATED CAN BE DETERMINED AS THE MATCH. THE IMPROVED METHOD CAN BE USED IN PROGRAMMING AREAS SUCH AS INQUIRY (OR QUERY) SYSTEMS AND THE JOB CONTROL LANGUAGE PROCESSOR OF AN OPERATING SYSTEM. THE PROGRAM USUALLY HANDLES MISSING CHARACTERS, TRANSPOSED CHARACTERS AND OTHER COMMON ERRORS. THE PROGRAM CAN BE EITHER A FIXED OR AN ADAPTIVE TYPE.

DEFENSIVE PUBLICATION UNITED STATES PATENT OFFICE Published at the request of the applicant or owner in accordance with the Notice of Dec. 16, 1969, 869 O.G. 687. The abstracts of Defensive Publication applications are identified by distinctly numbered series and are arranged chronologically. The heading of each abstract indicates the number of pages of specification, including claims and sheets of drawings contained in the application as originally filed. The files of these applications are available to the public for inspection and reproduction may be purchased for 30 cents a sheet.

Defensive Publication applications have not been examined as to the merits of alleged invention. The Patent Office makes no assertion as to the novelty of the disclosed subject matter.

PUBLISHED MAY 7, 1974 922 O.G. 1O

T922,008 DATA PROCESSOR PROGRAM FOR DETERMIN- ING CHARACTER STRING SIMILARITY Donald C. Gause, Owego, and Eduardo Kelierman and Gerald L. Rouse, Endicott, N.Y., assignors to International Business Machines Corporation, Armonk, N.Y. Continuation of application Ser. No. 154,093, June 17,

1971. This application Oct. 18, 1973, Ser. No. 407,733

Int. Cl. GOSb 19/28; G06f 7/34 US. Cl. 444-1 4 Sheets Drawing. 27 Pages Specification An improved program method is shown for determining the similarity of two character strings where they are not necessarily equal character by character. Each character of one character string is compared with each character of the other character string, and the results are stored in a matrix. Each row of the matrix corresponds to a respective one of the characters of the first string, and each column of the matrix corresponds to a respective character of the other string. A logical one value is stored in each matrix position where corresponding characters of the first and second string are equal, and a logical zero is stored in each position where they are not equal. In the event that there is more than one logical one in each row (or in each column, or both), all of the logical ones in that row (or column, or both) except that which is closest to the major diagonal of the matrix are changed to logical zero. In the event that two logical ones in a row (or column, or both) are equidistant from the major diagonal, both may be retained. The remaining logical ones in the matrix are then considered as points on an X, Y coordinate system, and the standard correlation cocfiicicnt (which measures linear dependence) of the points is calculated to determine the degree of similarity between the two character strings. A decision that sufiicient similarity exists to make a determination that a match has occurred may be based upon various known tests, for example, exceeding predetermined thresholds; or, in the event that a match is made against several character strings, the highest coeflicicnt calculated can be determined as the match. The improved method can be used in programming areas such as inquiry (or query) systems and the job control language processor of an operat- A 0 N E THE COLUMNlBEAI IF, AND ONLY IF, AID-DUI OTHERWISE, LET MI I;J)= D.

ROIV OF M CONTAINS MORE THAN ONE I, TH N LY THE ONE CIOSEST TO THE mIZN 2D;A0ONA% (THE MAIN DIAOONAL IS MIIgII,

STEP 2 IF A COLUMN OFM CONTAINS MORE THAN ONE I.

STEP 3 THEN RETAIN ONLY THE ONE CLOSEST TO THE MAIZNZDIAOONAL (THE MAIN DIAGONAL I5MII;I),

MI Inn") CONSIDER THE I'S IN N AS POINTS ON AN X-Y STEP 4 COORDINATE SYSTEM. THAT ISJF Mil-,JI ISI THEN VIE HAVEAPOINT WITH THE Y-CODRDINATE EOUAI. TO I AND THEX COORDINATE EOUAL TO J.

DETERMINE THE DEGREE OFSIMILARITY BY COMPUTING STEP 5 THE STANDARD CORRELATION OOEFFICIENTIWIIIOII MEASURES LINEAR DEPENOENOY OFTHE POINTS.

ing system. The program usually handles missing characters, transposed characters and other common errors. The program can be either a fixed or an adaptive type.

oo 1 o .t e 0 m n A 5 9 s Rt 4 Tmw C h 3|III.. MS A 2 J W M m O ZIJA I N VIII' I MMY ET m m D 000T E n smM 0000 U A 00 0 M G G m 0 0 .G R I cmuu 000 S D R w n M 5 mm 0 r mm w mm EOAUOI A Dh O OOH m M A O O M A III 0 FIG.2b FIG.2 c

FIG. 20

FIG. 1

FORM A MATRIX, M,BY LETTING MIT',J) IWHEREI INDICATES THE ROW O J INDICATES THE COLUMN EAII ,AND ONLY IF, AID=BIJI OTHERWISE, T M(I;J)=0.

IF A ROW OF M CONTAINS MORE THAN ONE I, THEN RETAIN ONLY THE ONE CIOSEST TO THE STEP I STEP 2 STEP 3 MAIN DIACONAL ITHE MAIN DIAGONAL ISMII;I), M(2;2).....)

EDUARDO KELLERMAN GERALD L. ROUSE DONALD C. GAUSE erg 4 ATTORNEY w 0 M V W EL G mun YMNO TIC I U M & R M N O O N A O CT 1 0 0 N E I Dl OMV nU I SWET -l T ..HA Irrrl mSTN R OI m 0 WmDn New mw m AT 0 D M T TN ICLmE DI T H ErErE D| RR M R v D E n E D A EL V EL E H A DN m N m D w NNCL DD |L S U OEU S OHO W CCTE ELHCL DT 4 5 DI EL D1 T E S q.

May 7, 1974 c GAUsE ETAL T922,008

DATA PROCESSOR PROGRAM FOR DETERMINING CHARACTER STRING SIMILARITY Original Filed June 17, 1971 4 Sheets-Sheet B CALCIILATED OOECTION QCI THRESHOLD CORRELATION TADLE LENGTH VALUE LIsT Q TABLE g I SCORE COORDINATE COORDINATE 'MATR IX VECTOR VECTOR A x Y MAX REC N O REC 1 3 REC.

m REC. T REC. I

k n I3- IO [H PROCESSOR I/O DEVICES CPU TERMINAL /12 FIG. 3

May 7, 1974 c, us ETAL T922,008

DATA PROCESSOR PROGRAM FOR DETERMINING CHARACTER STRING SIMILARITY 4 Sheets-Sheet 5 Original Filed June 17, 1971 READ INTO BUFT ALL DONE WPTMTSSRHND IEIT CARRIAGE T BUF2 BUFT N/ WAS A MATCH FOUND PRINT THE ANSWER ASSOC- IATED WITH QUEST|0N TMAX TYPE YES? NO ON TRY" AND HIT TE 'A EAENE J PRINT 'VWAS THAT RIGHT? READ INTO TN 37 TFT TNQTETTTTTTHTTT J INCREASE THRESHOLD FDR QUESTIONTTIMAX G DARRIAGE RETURN FIG. 4

ADD NEW QUESTION AND NEW ANSWER TO LIST May 7, 1974 D. C. GAUSE ETAL DATA PROCESSOR PROGRAM FOR DETERMINING CHARACTER Original Filed June 17, 1971 NO MATCH TO STEP 56 FIG. 4

STRING SIMILARITY 44 ISI GREATER THAN YES NUMBER OF QUESTIONS? E IS MAX TO STEP 28 FICA FIG.5

45 SCORE ET] ='DEGREE OE SIMILARITY' 0F BUFZ AND OEI;T

ENTER FROM STEP 35 4T PLACE A ZERO IN THOSE POSITIONS OF SCORE WHERE THE VALUE OF SCORE IS LESS THAN THE THRESHOLD OF THE CORRESPONDING QUESTION.

SET ZERO IN POSITION IN SCORE DEFINED BY 1 MAX 48 LEI MAX BE THE HIGHEST 51 VALUE IN SCORE YE 5O 

